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Abstract 

In  this  paper,  we  document  our  efforts  to  extend  our  statistical  question  answering  system 
for  TREC-11.  We  incorporated  a  web  search  feature,  and  novel  extensions  of  statistical 
machine  translation  as  well  as  extracting  lexical  patterns  for  exact  answers  from  a  super¬ 
vised  corpus.  Without  modification  to  our  base  set  of  thirty-one  categories,  we  were  able 
to  achieve  a  confidence  weighted  score  of  0.455  and  an  accuracy  of  29%.  We  improved  our 
model  on  selecting  exact  answers  by  insisting  on  exact  answers  in  the  training  corpus  and 
this  resulted  in  a  7%  gain  on  TREC-11  but  a  much  larger  gain  of  46%  on  TREC-10. 


1  Introduction 

TREC  evaluations  in  Question  Answering  pro¬ 
vide  a  useful  application  benchmark,  which  al¬ 
lows  validation  of  a  number  of  component  tech¬ 
nologies  for  which  evaluation  criteria  are  absent 
by  providing  a  score  for  the  integration  of  these 
components.  Our  approach  since  TREC-9  has 
been  to  investigate  a  mathematical  framework 
under  which  a  useful  solution  for  question  an¬ 
swering  could  be  produced.  We  will  present 
our  model  and  its  novel  extensions  below.  For 
training  our  system,  we  collected  a  4K  question- 
answer  corpus  based  on  trivia  questions  and  de¬ 
veloped  answer  patterns  for  the  TREC  collec¬ 
tion  of  documents.  This  corpus  was  used  to 
drive  a  number  of  components  we  will  describe 
below.  This  corpus  also  allowed  us  to  inves¬ 
tigate  weights  on  features  such  as  presence  of 
the  answer  chunk  in  web  documents  and  lexi¬ 
cal  patterns  found  in  answers.  We  also  describe 
our  efforts  after  the  evaluation  to  overcome  the 
inexact  answer  problem  and  present  results  ob¬ 
tained  since  the  evaluation. 

In  TREC-8  (Voorhees  and  Tice,  1999),  the 
NLP  community  began  the  task  of  evaluating 
Question  Answering  systems  and  has  in  sub¬ 
sequent  evaluations  provided  significant  chal¬ 


lenges  to  such  systems.  In  TREC-9,  the  chal¬ 
lenge  was  50-byte  answers  and  in  TREC-10 
it  was  definitional  questions  and  handling  re¬ 
jection.  To  address  these  challenges,  systems 
have  largely  adopted  the  architecture  of  pre¬ 
dicting  the  answer  tag  of  the  desired  answer, 
using  a  document  retrieval  method  to  select  rel¬ 
evant  documents  and  performing  answer  selec¬ 
tion  to  obtain  the  target  answer.  In  TREC- 
8  (Srihari  and  Li,  1999)  obtained  significant 
gains  using  an  expanded  class  of  entities  (66). 
In  TREC-9,  improved  performance  was  demon¬ 
strated  by  using  boolean  retrieval  and  feedback 
loops  (Harabagiu  and  et.  al.,  2000).  In  TREC- 
10,  use  of  a  large  number  of  patterns  was  shown 
to  perform  well  for  retrieving  answers  (Soub- 
botin,  2001).  In  TREC-11,  the  track  agreed  to 
several  significant  changes 

•  Exact  Answers 

•  Single  Answers 

•  Confidence-based  Ranking  of  Answers 

1.1  Exact  Answers 

Systems  were  required  to  return  answers  which 
had  only  the  desired  answer.  Extra  words  were 
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not  accepted  and  their  presence  caused  the  an¬ 
swer  to  be  judged  inexact.  Our  approach  of 
handling  exact  answers  was  to  use  our  phrases 
spanned  by  our  thirty  named  entity  categories 
as  well  as  constituent  phrases  of  the  syntactic 
parse  of  the  answer  (Penn  Treebank  style)  which 
satisfied  the  answer  pattern  for  the  question. 
The  decision  to  use  the  syntactic  parse  based 
phrases  caused  our  system  to  output  a  large 
number  of  answers  which  were  judged  as  inex¬ 
act.  We  will  describe  some  experiments  where 
we  changed  the  decision  to  accept  only  those 
phrases  which  exactly  satisfy  the  answer  pat¬ 
tern.  Our  named  entity  categories  do  not  cap¬ 
ture  the  differences  between  dates  and  years; 
nevertheless,  we  decided  to  evaluate  our  system 
without  modifying  the  named  entity  categories. 
The  named  entity  tags  are  broken  along  five  ma¬ 
jor  categories: 

Name  Expressions  Person,  Salutation,  Or¬ 
ganization,  Location,  Country,  Product 

Time  Expressions  Date,  Date-Reference, 
Time 

Number  Expressions  Percent,  Money,  Car¬ 
dinal,  Ordinal,  Age,  Measure,  Duration 

Earth  Entities  Geological  Objects,  Areas, 
Weather,  Plant,  Animal,  Substance,  At¬ 
traction 

Human  Entities  Events,  Organ,  Disease,  Oc¬ 
cupation,  Title-of-work,  Law,  People, 
Company-roles 

1.2  Single  Answers 

In  previous  TREC  evaluations,  systems  re¬ 
turned  upto  5  answers  per  questions.  In  TREC- 
11,  only  a  single  answer  was  returned  for  each 
question.  For  evaluating  single  answers,  the  cri¬ 
teria  used  in  this  evaluation  was  the  accuracy 
of  the  system. 

1.3  Confidence-based  Ranking  of 
Answers 

NIST  changed  the  metric  from  the  mean  recip¬ 
rocal  rank  (MRR)  of  previous  TREC  Q&A  eval¬ 
uations  to  the  Uninterpolated  Mean  Average 


Precision,  which  we  shall  refer  to  as  the  confi¬ 
dence  weighted  score  (CWS)  defined  as  follows, 

1  ^  #  correct  upto  question  i 
v  YV  o  =  — —  y  - ; - 

N z-'  i 

1=1 

where  N  is  the  number  of  questions.  This  met¬ 
ric  gives  more  credit  to  questions  answered  cor¬ 
rectly  at  the  beginning  of  the  list.  We  made 
no  specific  attempt  to  optimize  on  this  criteria 
and  instead  worked  mostly  on  optimizing  the 
accuracy  of  our  system. 

2  TREC  11  System 

We  model  the  distribution  p(c\a,  q),  which  at¬ 
tempts  to  measure  the  c,  ’correctness’,  of  the 
answer  and  question,  c  can  take  on  values  of 
either  0  and  1  indicating  either  an  incorrect  or 
correct  answer  respectively.  We  introduce  a  hid¬ 
den  variable  representing  the  class  of  the  an¬ 
swer,  e,  (answer  tag/ named  entity)  as  follows, 

P{c\q,a )  =  EeP(c,e|<?,a) 

=  EeP(c|e,g,a)p(e|g,a)  1 

The  terms,  p(e\q,  a)  and  p(c\e,q,a)  are  the 
familiar  answer  tag  problem  and  the  answer  se¬ 
lection  problem.  Instead  of  summing  over  all 
entities,  as  a  first  approximation  we  only  con¬ 
sider  the  top  entity  predicted  by  the  answer  tag 
model  and  then  find  the  answer  that  maximizes 
p(c\e,q,a). 

The  distribution  p(c\e,  q ,  a)  is  modeled  utiliz¬ 
ing  the  maximum  entropy  framework  described 
in  (Berger  et  al.,  1996).  We  built  on  top  of  the 
model  we  used  last  year  and  those  features  are 
described  in  (Ittycheriah  et  al.,  2001).  The  new 
features  we  investigated  for  this  year  are: 

•  Occurrence  of  the  answer  candidate  on  the 
web 

•  Re-ranking  of  answer  candidate  window  us¬ 
ing  a  statistical  MT  dictionary 

•  Lexical  patterns  from  supervised  training 
pairs 
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This  year  we  submitted  3  runs,  two  of  which 
measured  the  effectiveness  of  the  first  feature 
type.  The  last  run  was  a  feedback  loop  on  the 
first  run,  where  we  included  the  answer  string 
of  questions  which  had  sufficient  confidence  to 
further  improve  their  confidence.  The  results 
are  presented  below  in  Table  1.  We  also  pro¬ 
vided  the  output  of  system  ‘ibmsqa02a’  to  an¬ 
other  group  at  IBM  for  the  run  labeled  IBM- 
PQSQA.  The  integration  of  our  system’s  output 
with  their  question  answering  system,  improved 
their  base  performance  from  33.8%  to  35.6%,  an 
improvement  of  5.3%  in  accuracy  and  in  terms 
of  CWS,  from  0.534  to  0.586  (Chu-Carroll  et  al., 
2002). 

2.1  Training  Data 

We  used  TREC-8,  TREC-9,  and  4K  questions 
from  our  KM  database  to  train  the  model  this 
year.  This  corpus  represented  an  order  of  mag¬ 
nitude  increase  in  size  over  the  training  data  size 
we  used  last  year.  For  each  question  we  devel¬ 
oped  a  set  of  answer  patterns  by  judging  several 
potential  answer  sentences  in  the  TREC  corpus. 
Using  the  answer  patterns  and  sentences  derived 
from  the  TREC  corpus,  we  automatically  la¬ 
belled  chunks  as  being  correct  or  incorrect.  The 
total  number  of  chunks  used  in  formulating  the 
model  was  207K.  There  were  30K  instances  of 
correct  answers  (though  10K  were  inexact)  and 
177K  incorrect  chunks. 

2.2  Web  Feature 

The  web  feature  was  used  by  a  number  of  groups 
last  year  (Clarke  et  al.,  2001)  (Brill  et  al.,  2001) 
and  we  attempted  to  measure  its  impact  on  our 
system.  We  incorporated  the  feature  as  two  in¬ 
dicators:  (1)  occurrence  of  the  answer  candidate 
in  the  top  10  documents  retrieved  from  the  web, 
(2)  count  of  the  number  times  the  answer  candi¬ 
date  occurred.  This  feature  type  and  perform¬ 
ing  no  rejection  is  the  difference  between  the 
runs  ibmsqa02a  and  ibmsqa02b.  Removing  re¬ 
jection  the  correctly  rejected  questions,  we  note 
only  an  improvement  of  7  questions  by  using 
the  web  based  feature.  Our  systems  have  tra¬ 
ditionally  used  an  encyclopaedia  for  LCA  based 
expansion  and  this  may  explain  why  the  web 


feature  is  less  effective  in  our  system.  We  re¬ 
fer  to  this  method  of  using  the  web  as  Answer 
Verification  to  differentiate  it  with  other  ap¬ 
proaches  which  attempt  to  answer  the  question 
on  the  web  and  then  look  in  the  target  docu¬ 
ment  corpus  for  the  same  answer.  The  latter 
method  can  result  in  unsupported  answers.  We 
note  that  the  number  of  unsupported  answers 
is  not  significantly  different  between  runs  ‘ibm- 
sqa02a’  and  ‘ibmsqa02b’  (11  vs.  8)  but  when  we 
used  the  answer  strings  of  confident  questions  as 
feedback  to  the  run  ‘ibmsqa02c’,  the  number  of 
unsupported  answers  went  up  significantly  (18 
unsupported  answers). 

2.3  Statistical  Machine  Translation 
Thesaurus 

Generally,  an  answer  to  a  fact-seeking  question 
can  be  decomposed  as 

a  —  T  as  (2) 

where  ad  is  the  desired  answer  and  as  is  the 
supporting  evidence  for  the  answer.  Although 
words  comprising  the  answer  support  are  gen¬ 
erally  found  in  the  question,  words  such  as  the 
focus  of  the  question  are  sometimes  deleted  in 
the  answer.  Following  our  general  approach  of 
learning  phenomena  from  training  data,  we  used 
our  question-answer  corpus  to  train  a  Model  1 
translation  matrix  (Brown  et  al.,  1993).  Ques¬ 
tions  were  tokenized  with  casing  information 
folded  and  answers  were  both  tokenized  and 
name  entity  tagged.  A  question  answer  pair 
is  presented  below  before  and  after  the  pre¬ 
preprocessing. 

Q:  How  tall  is  Mt.  Everest? 

A:  He  started  with  the  highest  ,  29,028 

-  foot  Mt.  Everest  ,  in  198 4 

Q:  how  tall  is  mt.  everest  ? 

A:  he  started  with  the  highest  ,  29,028 

-  foot  mt.  everest  ,  in  1 984  measure^ne 

We  had  4K  training  pairs  from  the  KM  trivia 
database,  1.6K  pairs  from  TREC8  and  10. 7K 
pairs  from  TREC9.  The  latter  were  derived 
from  correct  judgements  given  to  questions  in 
those  evaluations  and  which  also  came  from 
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System 

Description 

CWS 

Right 

Inexact 

Unsup 

Wrong 

Rej 

ibmsqa02a 

Base  system 

0.454 

140  (28%) 

37 

11 

312  (62.4%) 

12/83 

ibmsqa02b 

No  web  or  rejection 

0.403 

121  (24.2%) 

43 

8 

328  (65.6%) 

0 

ibmsqa02c 

Feedback  loop 

0.455 

145  (29%) 

44 

18 

293  (58.6%) 

11/49 

Table  1:  Performance  on  TREC-11. 


unique  sentences  in  the  corpus.  This  data  was 
split  into  two  and  separate  translation  models 
were  derived.  Entries  which  occurred  in  both 
translation  models  were  retained;  a  few  of  the 
more  interesting  entries  are  shown  below  in  Ta¬ 
ble  2.  Each  word  is  shown  with  the  5  top  trans¬ 
lation  candidates.  For  the  word  “who”,  the 
model  prefers  to  see  a  named  entity  tag  “per- 
son_ne”  with  a  relative  high  probability.  Even 
though  the  number  of  translation  pairs  is  small 
(16. 3K  pairs),  for  the  question  answering  ap¬ 
plication  we  are  interested  in  only  the  most 
common  words,  which  are  potentially  modified 
in  the  translated  output  of  the  question;  rarer 
words  have  to  appear  identical  to  the  form  in  the 
question.  Using  this  additional  thesaurus  re¬ 
source,  we  re-ranked  the  answer  candidate  win¬ 
dows  (windows  of  text  bounded  by  the  question 
terms  and  the  answer  candidate)  and  quantized 
the  rank  into  5  bins  (1,2,  high,  mid  and  low) 
for  use  in  the  maximum  entropy  answer  selec¬ 
tion  module.  We  have  not  separately  investi¬ 
gated  the  effect  of  this  ranking,  so  details  will 
presented  in  the  future. 

2.4  Answer  Patterns 

The  approach  described  in  (Soubbotin,  2001) 
uses  patterns  for  locating  answers.  In  a  re¬ 
lated  work,  (Ravichandran  and  Hovy,  2002)  has 
shown  how  to  extract  patterns  in  an  unsuper¬ 
vised  manner  from  the  web.  In  this  work,  we  use 
the  supervised  corpus  of  question  and  answers 
to  extract  n-grams  occurring  in  the  answer.  To 
specialize  the  pattern  for  a  particular  question 
type,  the  question  was  represented  only  by  the 
question  word  and  the  first  word  to  its  right.  To 
generalize  the  answer  candidate  window,  it  was 
modified  to  replace  all  non-stop  question  words 
with  “<queryTerm>”  and  the  answer  candidate 


with  “<answer>”.  So  for  the  example  above, 

QF:  how  tall 

MW:  he  started  with  the  highest  , 

<answer>  <queryTerm>  measure^ne 

where  QF  stands  for  the  question  focus  and  MW 
stands  for  the  mapped  answer  candidate  win¬ 
dow.  Ideally,  the  question  would  be  represented 
by  more  than  just  the  word  adjacent  to  the  ques¬ 
tion  word  but  in  most  cases  this  suffices.  To 
overcome  some  of  the  limitations  of  this  choice, 
we  also  chose  features  relating  the  predicted  an¬ 
swer  tag  and  an  answer  pattern.  An  answer  pat¬ 
tern  consists  of  5-grams  or  larger  chosen  with  a 
count  cutoff.  The  total  number  of  pattern  fea¬ 
tures  incorporated  was  8.5K  out  15. 3K  features. 

3  Answer  Selection 

Answer  selection  was  performed  as  we  have  in 
previous  years  with  minor  modifications.  First, 
a  fast-nratch  technique  of  selecting  answer  sen¬ 
tences  is  used  and  top  100  sentences  are  se¬ 
lected.  This  phase  yields  sentences  which  have 
the  answer-  pattern  in  TREC-10  for  80%  of  the 
sentences.  Considering  the  approximately  10% 
of  questions  which  were  to  be  rejected  in  TREC- 
10,  the  error  of  the  sentence  selector  is  about 
10%  with  a  list  size  of  100  sentences. 

In  order  to  select  exact  answers,  we  extracted 
all  parse  nodes  which  were  noun  phrases  and  to¬ 
gether  with  all  phrases  which  were  named  enti¬ 
ties  formed  a  candidate  pool.  As  mentioned  be¬ 
fore,  our  system  suffered  a  great  deal  of  inexact 
answers  in  the  judgement  and  these  were  mostly 
due  to  the  decision  to  accept  any  phrase  thus  se¬ 
lected  which  had  an  answer  pattern.  Below  we 
discuss  some  experiments  in  which  a  phrase  is 
considered  correct  only  if  it  contains  only  the 
answer  pattern. 
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Haiti 

a 

t(a\q) 

haiti 

0.076 

port-au-prince 

0.048 

rniarni 

0.034 

people 

0.021 

haitian 

0.018 

who 

a 

t(a\q) 

personme 

0.125 

0.010 

the 

0.051 

0.046 

0.042 

river 

a 

t(a\q) 

river 

0.217 

the 

0.081 

water 

0.060 

location_ne 

0.039 

many 

0.028 

tall 

a 

t(a\q) 

measure_ne 

0.056 

foot 

0.041 

feet 

0.027 

- 

0.017 

i 

0.012 

nuclear 

a 

t(a\q) 

nuclear 

0.183 

atomic 

0.020 

at 

0.013 

soviet 

0.010 

site 

0.010 

team 

a 

t{a\q) 

team 

0.099 

organization_ne 

0.056 

game 

0.030 

5 

0.029 

their 

0.023 

Table  2:  Translation  entries  for  some  question  words. 


For  the  training  corpus  of  chunks  with  their 
labeled  decision  of  correct  or  incorrect,  we  for¬ 
mulated  features  such  as  whether  the  desired 
named  entity  was  found  in  the  chunk.  The  fea¬ 
tures  described  above  were  added  to  the  base 
model  described  in  (Ittycheriah  et  al.,  2001)  and 
weights  were  derived  using  the  maximum  en¬ 
tropy  algorithm.  For  a  typical  answer  candi¬ 
date,  50-100  features  are  able  to  fire  for  each 
decision.  The  answer  candidate  that  has  the 
highest  probability  is  chosen  for  the  output. 

4  Rejection 

For  questions  which  are  determined  to  have  no 
answer  in  the  corpus,  the  system  was  supposed 
to  return  ‘NIL’  as  the  document  id.  To  deter¬ 
mine  which  questions  to  reject,  we  employed  the 
distribution  p(c\q,a)  and  used  a  threshold  on 
the  distribution.  However,  the  system  some¬ 
times  encounters  events  which  are  not  suffi¬ 
ciently  represented  in  the  training  corpus  and 
to  allow  some  level  of  control  it  was  useful  to 
smooth  this  probability  with  a  decreasing  func¬ 
tion  of  chunk  rank.  This  smooth  estimate  was 
computed  as 

p*  =  (1  —  a)p(c\q,  a)  +  a(  1  —  0.1(chunk_rank)) 


where  chunk_rank  was  saturated  at  10.  This 
year  the  alpha  was  set  to  0.2  and  the  rejection 
threshold  to  0.3.  The  rejection  threshold  was 
optimized  on  the  accuracy  of  TREC-10  ques¬ 
tions  using  the  TREC  corpus  of  documents.  We 
plot  in  Figure  1,  the  cumulutive  distribution 
function  of  questions  with  answers  in  the  corpus 
and  also  1.0  minus  the  cumulutive  distribution 
function  for  questions  which  should  be  rejected. 
The  plot  is  for  TREC-10  questions  using  the 
TREC  corpus  of  documents  for  answers.  We 
expected  to  reject  about  80  answers  in  our  base 
system  and  the  actual  run  seems  to  have  done 
approximately  the  same.  The  feedback  loop  of 
ibmsqa02c  seems  to  have  reduced  the  number 
of  rejections  and  thus  the  precision  of  rejections 
has  improved  from  0.145  to  0.224  while  main¬ 
taining  the  recall  rate. 

5  Analysis  Sc  Subsequent 
Experiments 

One  method  of  characterizing  a  test  set  is  with 
respect  to  a  set  of  answer  tags.  The  primary 
difference  between  TREC-10  and  TREC-11  is 
in  the  composition  of  the  answer  tags  and  these 
are  presented  for  the  top  set  of  tags  in  Figure 
2.  The  drastic  difference  between  the  test  sets 
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output  score 


Figure  1:  TREC-10  scores  for  normal  and  rejection  questions. 


is  in  the  number  of  questions  being  classified 
PHRASE  (this  class  represents  any  question 
which  does  not  fall  into  other  categories).  This 
result  reflects  the  reduction  in  definitional  ques¬ 
tions  and  the  emphasis  on  exact  answer  ques¬ 
tions;  however,  calibrating  rejection  rates  and 
system  strategies  on  TREC-10  is  mismatched 
with  the  evaluation. 

In  order  to  overcome  the  excessive  number 
of  inexact  answers  produced  by  our  system,  we 
trained  the  model  indicating  only  those  phrases 
which  exactly  matched  the  answer  pattern  to  be 
correct.  As  noted  earlier,  this  is  only  a  partial 
solution  since  some  answers  are  now  considered 
incorrect  when  they  seem  quite  reasonable.  For 
example  in  the  first  question  of  TREC-8,  the 
answer  of  “Hugo  Young”  is  now  considered  in¬ 
correct  since  the  answer  pattern  contains  only 
“Young”.  This  exact  match  reduced  the  num¬ 
ber  of  correct  training  instances  about  33%  (re¬ 
duced  from  30K  to  20K  where  the  total  num¬ 
ber  of  training  instances  is  207K);  inspection 
of  these  instances  indicates  (a)  some  exact  an¬ 
swers  are  now  labelled  incorrect,  (b)  majority  of 
phrases  containing  the  answer  plus  extra  words 
are  now  labelled  incorrect.  The  number  of  an¬ 
swers  of  type  (a)  is  relatively  small  (estimated 
about  10%  of  the  chunks).  We  then  calibrated 
the  performance  of  our  system  on  TREC-11  by 


using  the  answer  patterns  and  modifying  the 
scoring  script  to  accept  the  pattern  only  if 

if  ($answer_str  =~  /~ (\s+)?$p(\s+)?$/i) 

The  results  of  the  system  using  the  answer 
patterns  are  generally  lower  and  each  run  seems 
to  suffer  about  the  same  amount.  Table  5  shows 
the  results  of  using  the  new  model.  We  empha¬ 
size  that  these  results  are  obtained  using  the 
perl  patterns  as  opposed  to  human  judgments 
in  the  evaluation.  In  order  to  remove  the  ef¬ 
fect  of  rejection,  we  modified  the  threshold  (to 
0.22  from  0.3)  in  the  new  model  to  output  about 
the  same  number  of  questions  rejected  so  that 
the  improvement  in  scores  is  not  dominated  by 
getting  only  rejection  questions  correct.  The  re¬ 
sults  indicate  a  46%  improvement  in  the  TREC- 
10  test  but  only  about  7%  gain  in  TREC-11. 
Investigating  this  discrepancy  will  be  subject  of 
future  work. 

In  Table  3,  these  are  answers  which  were  ac¬ 
cepted  by  the  evaluation  system  but  are  now 
training  examples  for  the  incorrect  answers.  Ex¬ 
amples  of  system  output  with  the  exact  answer 
fix  is  shown  in  Table  4  with  the  older  strings  as 
well  to  demonstrate  the  nature  of  the  fix.  The 
first  two  examples  show  answers  which  satisfy 
the  answer  patterns  exactly  at  test  time.  The 
last  two  example  show  errors  by  the  system,  but 
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Answer  Tags 


Figure  2:  Comparison  of  answer  tags  between  TREC-10  and  11. 


What  canine  was  made  famous  by  Eric 
Knight? 

Lassie  Come  -  Home 

Professor  Moriarty  was  whose  rival? 

Sherlock  Holmes’  neme¬ 
sis 

What  is  Francis  Scott  Key  best  known  for? 

write  the  Star  Spangled 
Banner 

Table  3:  Training  data  instances  which  are  rejected  for  the  exact  answer  fix. 


Qnurn 

Question 

Old  Answer 

Answer 

1059 

What  peninsula  is  Spain  part  of? 

position  on  the  Iberian 
Peninsula 

Iberian  Peninsula 

1215 

When  was  President  Kennedy  shot? 

shot  on  Nov.  22  ,  1963 

Nov.  22  ,  1963 

1316 

What  was  the  name  of  the  plane  Lind¬ 
bergh  flew  solo  across  the  Atlantic? 

Spirit  of  St.  Louis 

Charles  A. 

1348 

How  cold  should  a  refrigerator  be? 

28  degrees  Farenheit 

soda  ice  cold 

Table  4:  TREC-10  QA  pairs  before  and  after  the  exact  answer  fix. 


System 

Description 

cws 

Right 

Wrong 

Rej 

treclO-perl 

Base  system 

0.289 

92 

408 

10/76 

treclO-perl 

Exact  answer  fix 

0.423 

127 

373 

16/92 

ibmsqa02a-perl 

Base  system 

0.438 

134 

366 

12/83 

ibmsqa02a-perl 

Exact  answer  fix 

0.469 

144 

356 

4/52 

Table  5:  Experimental  results  using  perl-patterns  since  TREC-11  evaluation. 
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overall  the  system  was  able  to  produce  more  an¬ 
swers  which  satisfied  the  exact  match  criteria. 

6  Conclusions  and  Future  Work 

In  TREC-11,  our  method  of  selecting  which 
candidates  were  exact  answers  did  not  satisfy 
the  exact  match  criteria  of  the  evaluation.  We 
have  since  modified  our  system  to  extract  exact 
answers  and  retrained  the  system.  We  incor¬ 
porated  two  novel  concepts  (a  statistical  ma¬ 
chine  translation  thesaurus  and  lexical  patterns 
derived  from  supervised  question-answer  pairs) 
since  last  year. 

In  TREC-11,  although  we  thresholded  the 
distribution  p(c\q,a)  to  reject  answers,  this  we 
recognize  as  being  deficient  in  the  following 
sense.  We  should  recognize  a  question  as  not 
having  an  answer  in  the  corpus  by  taking  into 
consideration  all  the  answers  found  and  not  just 
the  top  ranking  answer. 
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